version 13 // indicates version
cls // clears backscroll display buffer
clear all // "start from scratch"
set more off // suppresses -more- prompts

/* load background variables */
cd "C:\Users\benja\Dropbox\BenjaminWork\ResearchProject\data\creation\data-sets and merge"
use d01_bv.dta 

/* create indicator for obervations from the first wave(s) */
drop firstwave
bys id (wave): gen firstwave = (wave[1] == 7) | (wave[1] == 8) 
// i.e. looking at all waves for each observation separately, the first wave has to either be 7 or 8
label var firstwave "Belongs to the original sample gathered 2007/08"
save d01_bv.dta, replace

/* 
  creating a balanced panel
  i.e. for each observation we have:

	id wave age
	1  7    .
	1  8    .
	1  9	.
	1  10   54
	1  11   55
	1  12   56
	1  13   57
	1  14   58
	1  15   59
	
  although they might as in this example entered the panel in wave 10 
*/
bysort id: keep if _n==1
keep id
save idlist.dta, replace
clear
set obs 9 
gen wave = _n + 6
cross using idlist.dta // all pair wise combinations of time and id
merge 1:1 id wave using d01_bv.dta // merge the real data
erase idlist.dta

/* sequential 1:1 merging with ID WAVE as unique identifier */
merge 1:1 id wave using d02_nfc.dta, gen(_merge02)
merge 1:1 id wave using d03_mood.dta, gen(_merge03)
merge 1:1 id wave using d04_lonely.dta, gen(_merge04)
merge 1:1 id wave using d05_leisure.dta, gen(_merge05)
merge 1:1 id wave using d06_polint.dta, gen(_merge06)
merge 1:1 id wave using d07_polinst.dta, gen(_merge07)
merge 1:1 id wave using d08_b5.dta, gen(_merge08)
merge 1:1 id wave using d09_st.dta, gen(_merge09) 
merge 1:1 id wave using d10_vote.dta, gen(_merge10)
merge 1:1 id wave using d11_sas.dta, gen(_merge11)
merge 1:1 id wave using d12_part.dta, gen(_merge12) // here we find some IDs that don't turn up in the balanced panel, which is weird; most of them have missing values on participation variable
drop if _merge12 == 2

/*
  discussion:
  As we have a balanced panel with a first master file (background variables) that contains all
  IDs there are in the LISS, we should not have the situation that IDs coming from the using
  data sets (i.e. those that are added) cannot be matched with any ID from the master file. 
  And indeed _mergeXX == 2 is (almost) never the case. We have cases where _mergeXX = 1 meaning that IDs
  that exist in the master file don't exist in the using data sets. This is normal as not everyone has
  answered all questionaires in all waves. 
*/

/* drop the merging information */
drop _merge* 

/* 
  For wave 7 we only have background information. Hence, we cannot 
  draw any conclusions from this wave. Survey attitude scale for instance  
  is missing for everyone. 
*/
// drop if wave == 7 | wave > 11
drop if wave == 7

/* 
  Drop observations that do not belong to original sample 
  Why do we do this because: if we only use first cohort, we can disentangle cohort effect from the effect of length of time 
  because we can assume that not incorporating the later waves is MCAR
*/
drop if firstwave == 0

/* Delete rows that are completely missing */
dropmiss simpc_hh - dropout, obs force // missing values on all variables
dropmiss dropout part, obs force // missing values on the dependent variables
bys id (wave): gen nobs = _N
dropmiss simpc_hh - part if nobs == 1, obs force // missing values on all variables besides dropout but only 1 participation
drop nobs

// to use vce(cluster id_hh), we need to ensure that every id is indeed nested in a id_hh
// - this might not be the case because of split offs; unfortunately the recruit variable is
//   not comprehensive in that respect, some of those respondents not marked as split-off,
//   are actually split-offs
// - plus, we have missing values in id_hh
// --> respondents that exhibit either/or have to be excluded

// everyone gets there first household id:
bys id (id_hh): replace id_hh = id_hh[1] 

// N id's per id_hh:
bys id_hh id: gen nvals = (_n == 1)
bys id_hh (id): regen nvals = total(nvals), replace

drop if nvals > 10 // ids belonging to the household "missing value" are excluded

// how many id_hh in a id?
bys id id_hh: gen nvals2 = (_n == 1)
bys id (id_hh): regen nvals2 = total(nvals2), replace

drop if nvals2 > 1 

drop nvals*

//////////////////////////////////////////////////////////////////////////////////////////

/* 
  Imputations 
  Unfortunately the background variables data set is a bit fragmented. 
  For instance, we have cases that look like this:
  
  id wave age 
  1    8   . 
  1    9  54 
  1   10   . 
  
  Hence, for some waves, information, which can be deduced from other waves, is missing.
  There are several methods to deal with it. The best method depends also on the kind 
  of variable. I would not impute variables other than background variables as variables
  like NEED FOR COGNITION should be perceived complete random variables that we cannot
  deduce beforehand. But the following variables could be discussed to be impute to complete
  the data set, where information is missing in some waves:
       
    ------------------------------------------------------------
(1) variables should not vary at all or we know the development:
    ------------------------------------------------------------
    migrantd, birthyear (age), female, simpc_hh, recruit, id_hh,  
	hhtype (i.e.hhtype_sel hhtype_ren), hhurban, position (i.e. 
	pos_part pos_child) 
	
	------------------------------------------------------------
(2) we expect very little variation:
    ------------------------------------------------------------
    partner_hhh, age_hhh, livstatus_hhh, hhsize, hhchildren, 
	edu, civilstatus (i.e. civil_marr civil_wid civil_nev civil_div),
	occ (i.e. occ_emp occ_seek occ_stu occ_soc occ_home occ_pen)
	
	------------------------------------------------------------
(3) variation is expected:
    ------------------------------------------------------------
    income_hh, income   
	
	In the (1)st case, it clear that we can use both carry back- and forward.
	In the (2)nd case, I would recommend to impute only in between valid observations 
	                   that have the same value
	In the (3)rd case, I would use imputation of missings by linear interpolation 
					   in between valid observations
  
*/

///////////////////////////////////
/* (1) carry back- and forwards */
/////////////////////////////////

// good example ID = 800744 //

// id wave age replacement
//  1    8   . 53
//  1    9  54 54
// --> carry backwards
gsort +id -wave
bys id: replace migrantd = migrantd[_n-1] if (migrantd == .)
bys id: replace birthyear = birthyear[_n-1] if (birthyear == .)
bys id: replace age = age[_n-1] if (age == .)
bys id: replace female = female[_n-1] if (female  == .)
bys id: replace simpc_hh = simpc_hh[_n-1] if (simpc_hh  == .) 
bys id: replace recruit = recruit[_n-1] if (recruit  == . & recruit[_n-1] != 99) // split off hh should not be considered
bys id: replace id_hh = id_hh[_n-1] if (id_hh  == . & recruit[_n-1] != 99) // split off hh should not be considered
bys id: replace hhtype = hhtype[_n-1] if (hhtype == . & recruit[_n-1] != 99) // split off hh should not be considered
bys id: replace hhtype_sel = hhtype_sel[_n-1] if (hhtype_sel == . & recruit[_n-1] != 99) // split off hh should not be considered
bys id: replace hhtype_ren = hhtype_ren[_n-1] if (hhtype_ren == . & recruit[_n-1] != 99) // split off hh should not be considered
bys id: replace hhurban = hhurban[_n-1] if (hhurban  == . & recruit[_n-1] != 99) // split off hh should not be considered
bys id: replace position = position[_n-1] if (position  == . & recruit[_n-1] != 99) // split off hh should not be considered
bys id: replace pos_part = pos_part[_n-1] if (pos_part  == . & recruit[_n-1] != 99) // split off hh should not be considered
bys id: replace pos_child = pos_child[_n-1] if (pos_child  == . & recruit[_n-1] != 99) // split off hh should not be considered
 
// id wave age replacement
//  1    8  54 54
//  1    9   . 55
// --> carry forwards
gsort +id +wave
bys id: replace migrantd = migrantd[_n-1] if (migrantd == .)
bys id: replace birthyear = birthyear[_n-1] if (birthyear == .)
bys id: replace age = age[_n-1] if (age == .)
bys id: replace female = female[_n-1] if (female  == .)
bys id: replace simpc_hh = simpc_hh[_n-1] if (simpc_hh  == .) 
bys id: replace recruit = recruit[_n-1] if (recruit  == . & recruit[_n-1] != 99) // split off hh should not be considered
bys id: replace id_hh = id_hh[_n-1] if (id_hh  == . & recruit[_n-1] != 99) // split off hh should not be considered
bys id: replace hhtype = hhtype[_n-1] if (hhtype == . & recruit[_n-1] != 99) // split off hh should not be considered
bys id: replace hhtype_sel = hhtype_sel[_n-1] if (hhtype_sel == . & recruit[_n-1] != 99) // split off hh should not be considered
bys id: replace hhtype_ren = hhtype_ren[_n-1] if (hhtype_ren == . & recruit[_n-1] != 99) // split off hh should not be considered
bys id: replace hhurban = hhurban[_n-1] if (hhurban  == . & recruit[_n-1] != 99) // split off hh should not be considered
bys id: replace position = position[_n-1] if (position  == . & recruit[_n-1] != 99) // split off hh should not be considered
bys id: replace pos_part = pos_part[_n-1] if (pos_part  == . & recruit[_n-1] != 99) // split off hh should not be considered
bys id: replace pos_child = pos_child[_n-1] if (pos_child  == . & recruit[_n-1] != 99) // split off hh should not be considered

/* variables birthyear and female not constant across id, taking the mode... */
bys id (wave): regen female = mode(female), maxmode replace
bys id (wave): regen birthyear = mode(birthyear), maxmode replace
bys id (wave): regen age = mode(age), maxmode replace
bys id (wave): regen voted = mode(voted), maxmode replace
// here //


////////////////////////////////////////////////////////////////////
/* (2) imputation in between valid observations (carry forwards) */
//////////////////////////////////////////////////////////////////

// good example ID = 891723 //

// partner_hhh //
bysort id (wave): gen first = sum(partner_hhh<.)==1 & sum(partner_hhh[_n-1]<.)== 0 //first valid 
gsort id -wave
by id: gen last = sum(partner_hhh<.)==1 & sum(partner_hhh[_n-1]<.)==0 //last valid 
bysort id (wave): gen spell = sum(first)-sum(last[_n-1]) //spell “being in panel”
* Filling in the value from before
bysort id spell (wave): replace partner_hhh=partner_hhh[_n-1] if partner_hhh==. & spell==1 
drop first last spell

// age_hhh //
bysort id (wave): gen first = sum(age_hhh<.)==1 & sum(age_hhh[_n-1]<.)== 0 //first valid 
gsort id -wave
by id: gen last = sum(age_hhh<.)==1 & sum(age_hhh[_n-1]<.)==0 //last valid 
bysort id (wave): gen spell = sum(first)-sum(last[_n-1]) //spell “being in panel”
* Filling in the value from before 
bysort id spell (wave): replace age_hhh=age_hhh[_n-1] + 1 if age_hhh==. & spell==1 & partner_hhh != 0
drop first last spell

// livstatus_hhh //
bysort id (wave): gen first = sum(livstatus_hhh<.)==1 & sum(livstatus_hhh[_n-1]<.)== 0 //first valid 
gsort id -wave
by id: gen last = sum(livstatus_hhh<.)==1 & sum(livstatus_hhh[_n-1]<.)==0 //last valid 
bysort id (wave): gen spell = sum(first)-sum(last[_n-1]) //spell “being in panel”
* Filling in the value from before
bysort id spell (wave): replace livstatus_hhh=livstatus_hhh[_n-1] if livstatus_hhh==. & spell==1 
drop first last spell

// hhsize //
bysort id (wave): gen first = sum(hhsize<.)==1 & sum(hhsize[_n-1]<.)== 0 //first valid 
gsort id -wave
by id: gen last = sum(hhsize<.)==1 & sum(hhsize[_n-1]<.)==0 //last valid 
bysort id (wave): gen spell = sum(first)-sum(last[_n-1]) //spell “being in panel”
* Filling in the value from before
bysort id spell (wave): replace hhsize=hhsize[_n-1] if hhsize==. & spell==1 
drop first last spell

// hhchildren //
bysort id (wave): gen first = sum(hhchildren<.)==1 & sum(hhchildren[_n-1]<.)== 0 //first valid 
gsort id -wave
by id: gen last = sum(hhchildren<.)==1 & sum(hhchildren[_n-1]<.)==0 //last valid 
bysort id (wave): gen spell = sum(first)-sum(last[_n-1]) //spell “being in panel”
* Filling in the value from before
bysort id spell (wave): replace hhchildren=hhchildren[_n-1] if hhchildren==. & spell==1 
drop first last spell

// edu //
bysort id (wave): gen first = sum(edu<.)==1 & sum(edu[_n-1]<.)== 0 //first valid 
gsort id -wave
by id: gen last = sum(edu<.)==1 & sum(edu[_n-1]<.)==0 //last valid 
bysort id (wave): gen spell = sum(first)-sum(last[_n-1]) //spell “being in panel”
* Filling in the value from before
bysort id spell (wave): replace edu=edu[_n-1] if edu==. & spell==1 
drop first last spell

// civilstatus //
bysort id (wave): gen first = sum(civilstatus<.)==1 & sum(civilstatus[_n-1]<.)== 0 //first valid 
gsort id -wave
by id: gen last = sum(civilstatus<.)==1 & sum(civilstatus[_n-1]<.)==0 //last valid 
bysort id (wave): gen spell = sum(first)-sum(last[_n-1]) //spell “being in panel”
* Filling in the value from before
bysort id spell (wave): replace civilstatus=civilstatus[_n-1] if civilstatus==. & spell==1 
drop first last spell

	// dummies:
bysort id (wave): gen first = sum(civil_marr<.)==1 & sum(civil_marr[_n-1]<.)== 0 //first valid 
gsort id -wave
by id: gen last = sum(civil_marr<.)==1 & sum(civil_marr[_n-1]<.)==0 //last valid 
bysort id (wave): gen spell = sum(first)-sum(last[_n-1]) //spell “being in panel”
* Filling in the value from before
bysort id spell (wave): replace civil_marr=civil_marr[_n-1] if civil_marr==. & spell==1 
drop first last spell

bysort id (wave): gen first = sum(civil_wid<.)==1 & sum(civil_wid[_n-1]<.)== 0 //first valid 
gsort id -wave
by id: gen last = sum(civil_wid<.)==1 & sum(civil_wid[_n-1]<.)==0 //last valid 
bysort id (wave): gen spell = sum(first)-sum(last[_n-1]) //spell “being in panel”
* Filling in the value from before
bysort id spell (wave): replace civil_wid=civil_wid[_n-1] if civil_wid==. & spell==1 
drop first last spell

bysort id (wave): gen first = sum(civil_nev<.)==1 & sum(civil_nev[_n-1]<.)== 0 //first valid 
gsort id -wave
by id: gen last = sum(civil_nev<.)==1 & sum(civil_nev[_n-1]<.)==0 //last valid 
bysort id (wave): gen spell = sum(first)-sum(last[_n-1]) //spell “being in panel”
* Filling in the value from before
bysort id spell (wave): replace civil_nev=civil_nev[_n-1] if civil_nev==. & spell==1 
drop first last spell

bysort id (wave): gen first = sum(civil_div<.)==1 & sum(civil_div[_n-1]<.)== 0 //first valid 
gsort id -wave
by id: gen last = sum(civil_div<.)==1 & sum(civil_div[_n-1]<.)==0 //last valid 
bysort id (wave): gen spell = sum(first)-sum(last[_n-1]) //spell “being in panel”
* Filling in the value from before
bysort id spell (wave): replace civil_div=civil_div[_n-1] if civil_div==. & spell==1 
drop first last spell

// occ //
bysort id (wave): gen first = sum(occ_cat<.)==1 & sum(occ_cat[_n-1]<.)== 0 //first valid 
gsort id -wave
by id: gen last = sum(occ_cat<.)==1 & sum(occ_cat[_n-1]<.)==0 //last valid 
bysort id (wave): gen spell = sum(first)-sum(last[_n-1]) //spell “being in panel”
* Filling in the value from before
bysort id spell (wave): replace occ_cat=occ_cat[_n-1] if occ_cat==. & spell==1 
drop first last spell

	// dummies:
bysort id (wave): gen first = sum(occ_emp<.)==1 & sum(occ_emp[_n-1]<.)== 0 //first valid 
gsort id -wave
by id: gen last = sum(occ_emp<.)==1 & sum(occ_emp[_n-1]<.)==0 //last valid 
bysort id (wave): gen spell = sum(first)-sum(last[_n-1]) //spell “being in panel”
* Filling in the value from before
bysort id spell (wave): replace occ_emp=occ_emp[_n-1] if occ_emp==. & spell==1 
drop first last spell

bysort id (wave): gen first = sum(occ_seek<.)==1 & sum(occ_seek[_n-1]<.)== 0 //first valid 
gsort id -wave
by id: gen last = sum(occ_seek<.)==1 & sum(occ_seek[_n-1]<.)==0 //last valid 
bysort id (wave): gen spell = sum(first)-sum(last[_n-1]) //spell “being in panel”
* Filling in the value from before
bysort id spell (wave): replace occ_seek=occ_seek[_n-1] if occ_seek==. & spell==1 
drop first last spell

bysort id (wave): gen first = sum(occ_stu<.)==1 & sum(occ_stu[_n-1]<.)== 0 //first valid 
gsort id -wave
by id: gen last = sum(occ_stu<.)==1 & sum(occ_stu[_n-1]<.)==0 //last valid 
bysort id (wave): gen spell = sum(first)-sum(last[_n-1]) //spell “being in panel”
* Filling in the value from before
bysort id spell (wave): replace occ_stu=occ_stu[_n-1] if occ_stu==. & spell==1 
drop first last spell

bysort id (wave): gen first = sum(occ_soc<.)==1 & sum(occ_soc[_n-1]<.)== 0 //first valid 
gsort id -wave
by id: gen last = sum(occ_soc<.)==1 & sum(occ_soc[_n-1]<.)==0 //last valid 
bysort id (wave): gen spell = sum(first)-sum(last[_n-1]) //spell “being in panel”
* Filling in the value from before
bysort id spell (wave): replace occ_soc=occ_soc[_n-1] if occ_soc==. & spell==1 
drop first last spell

bysort id (wave): gen first = sum(occ_home<.)==1 & sum(occ_home[_n-1]<.)== 0 //first valid 
gsort id -wave
by id: gen last = sum(occ_home<.)==1 & sum(occ_home[_n-1]<.)==0 //last valid 
bysort id (wave): gen spell = sum(first)-sum(last[_n-1]) //spell “being in panel”
* Filling in the value from before
bysort id spell (wave): replace occ_home=occ_home[_n-1] if occ_home==. & spell==1 
drop first last spell

bysort id (wave): gen first = sum(occ_pen<.)==1 & sum(occ_pen[_n-1]<.)== 0 //first valid 
gsort id -wave
by id: gen last = sum(occ_pen<.)==1 & sum(occ_pen[_n-1]<.)==0 //last valid 
bysort id (wave): gen spell = sum(first)-sum(last[_n-1]) //spell “being in panel”
* Filling in the value from before
bysort id spell (wave): replace occ_pen=occ_pen[_n-1] if occ_pen==. & spell==1 
drop first last spell


///////////////////////////////////////////////////////////////////////////////////////
/* (3) Imputation of missings by linear interpolation in between valid observations */
/////////////////////////////////////////////////////////////////////////////////////

// income_hh //
bysort id (wave): gen first = sum(income_hh<.)==1 & sum(income_hh[_n-1]<.)== 0 // first valid 
gsort id -wave
by id: gen last = sum(income_hh<.)==1 & sum(income_hh[_n-1]<.)==0 // last valid 
bysort id (wave): gen spell = sum(first)-sum(last[_n-1]) // indicator for spell “being in panel”
gen spellm = 1 if spell==1 & income_hh==. // indicator for missing spell (MS)
bysort id spellm: gen lspellm=_N if spellm==1 // length of missing spell
bysort id spellm (wave): gen nrspellm=_n if spellm==1 // numbering the person-years of MS
bysort id (wave): gen incb = income_hh[_n-1] if spellm==1 & spellm[_n-1]==. // last before MS
gsort id -wave
by id: gen inca = income_hh[_n-1] if spellm==1 & spellm[_n-1]==. // first after MS
bysort id (incb): replace incb = incb[1] if spellm==1 // filling up incb
bysort id (inca): replace inca = inca[1] if spellm==1 // filling up inca
sort id wave
replace income_hh = incb + nrspellm * ((inca-incb)/(lspellm+1)) if spellm==1 //imputing missing 
drop first last spell spellm lspellm nrspellm incb inca

// income //
// its a bit dangerous here, as there may be many missings because of the occupation
// is something like "Is too young to have an occupation"
// However this should rarely meet the requirement that the missings fall in between
// valid observations...
bysort id (wave): gen first = sum(income<.)==1 & sum(income[_n-1]<.)== 0 // first valid 
gsort id -wave
by id: gen last = sum(income<.)==1 & sum(income[_n-1]<.)==0 // last valid 
bysort id (wave): gen spell = sum(first)-sum(last[_n-1]) // indicator for spell “being in panel”
gen spellm = 1 if spell==1 & income==. // indicator for missing spell (MS)
bysort id spellm: gen lspellm=_N if spellm==1 // length of missing spell
bysort id spellm (wave): gen nrspellm=_n if spellm==1 // numbering the person-years of MS
bysort id (wave): gen incb = income[_n-1] if spellm==1 & spellm[_n-1]==. // last before MS
gsort id -wave
by id: gen inca = income[_n-1] if spellm==1 & spellm[_n-1]==. // first after MS
bysort id (incb): replace incb = incb[1] if spellm==1 // filling up incb
bysort id (inca): replace inca = inca[1] if spellm==1 // filling up inca
sort id wave
replace income = incb + nrspellm * ((inca-incb)/(lspellm+1)) if spellm==1 //imputing missing
drop first last spell spellm lspellm nrspellm incb inca

////////////////////////////////////////
/* fraction of missings by respondent */
////////////////////////////////////////

# d ;
global X "simpc_hh hhsize hhurban income_hh age female edu migrantd trust voted opcosts agreeablenessscore joyscore valscore burscore"; // short notation
# d cr

egen frm = rowmiss($X)
replace frm = frm / 15
bys id (wave): regen frm = mean(frm), replace
label var frm "fraction of missing values of all used predictors per respondent before MI"


/////////////////////////////////////////////////////////////////////////////////////////////////////
/// END /////////////////////////////////////////////////////////////////////////////////////////////
/////////////////////////////////////////////////////////////////////////////////////////////////////

/* check for duplicate entries */
bys id wave: gen N = _N
assert N == 1
drop N // no duplicate entries

saveold finaldata, replace
   
